Materials

First we need to get data. There’s not many datasets available on primatology but I found some:

  1. The evolution of primate general and cultural intelligence
  1. Gombe chimpanzee personality

Nice, we have one dataset about tool-use (multiple species) and one about behaviour (single chimpanzee population). Two very different problems, but both can be approached by the same toolkit of methods we just described.

## Warning: package 'ggfortify' was built under R version 3.5.3

Objectives

Learn what is a dimensional reduction technique: when to use, general purpose, practical applications.

  1. exploratory data analysis → Data visualization
  2. feature extraction → Data mining

Problem 1: Primate cognition dataset

toolsRAW <- read.csv("./Data_ReaderHagerLalandPhilTrans2011.csv")
tools_data <- toolsRAW[,3:13]
str(tools_data) 
## 'data.frame':    238 obs. of  11 variables:
##  $ SpeciesPurvis                : Factor w/ 238 levels "Allenopithecus nigroviridis",..: 22 23 24 25 26 46 47 48 49 50 ...
##  $ GenusPurvis                  : Factor w/ 58 levels "Allenopithecus",..: 6 6 6 6 6 11 12 12 12 12 ...
##  $ Taxon                        : Factor w/ 3 levels "Prosimian","Simian",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Great.ape                    : Factor w/ 2 levels "Great ape","No": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Innovation                   : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Tool.use                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Extractive.foraging          : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ Social.learning              : int  0 0 0 2 0 0 0 0 0 0 ...
##  $ Innovation..Reduced.         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Tool.use..Reduced.           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Extractive.foraging..Reduced.: int  0 0 0 0 0 0 0 0 0 0 ...

Question 1: How many possible ways to “visualize” the numerical variables?

pairs(tools_data[,8:11])

toolsPC <- prcomp(tools_data[,8:11], center = TRUE, scale. = TRUE)
autoplot(toolsPC, data = tools_data, shape = FALSE, loadings = TRUE,
         loadings.label = TRUE, colour = "Taxon")

Question 2: Which species is observation 43? What about 49 and 42?

Great apes are just too different. See with your cursor

vizier::embed_plotly(toolsPC$x, tools_data$GenusPurvis)

What if we remove them of the equation?

no_ga <- tools_data[tools_data$Great.ape == "No",] # remove great apes
no_gaPC <- prcomp(no_ga[,8:11], center = TRUE, scale. = TRUE)
autoplot(no_gaPC, data = no_ga, shape = FALSE, loadings = TRUE,
         loadings.label = TRUE, colour = "Taxon")

“Zooming” in the rest of the data creates new outliers again… It’s like there is almost a “fractaloid” jump in cognition. Or a literature bias towards studying intensivly the same few species.

Question 3: What species are 18… and 29? Notice the directions of loading.

vizier::embed_plotly(no_gaPC$x, no_ga$GenusPurvis)

We are running again and again unto the same problem. Is there a way to change the spread of the data so that we can have a good look at all species?

What about using UMAP?

pCoo <- umap(tools_data[,8:11], verbose = TRUE, n_neighbors = 2, spread = 10)
## 09:44:17 Read 238 rows and found 4 numeric columns
## 09:44:17 Using FNN for neighbor search, n_neighbors = 2
## 09:44:18 Commencing smooth kNN distance calibration using 2 threads
## 09:44:18 Found 13 connected components, falling back to 'spca' initialization
## 09:44:18 Initializing from PCA
## 09:44:18 PCA: 2 components explained 99.06% variance
## 09:44:18 Commencing optimization for 500 epochs, with 450 positive edges
## 09:44:18 Optimization finished
vizier::embed_plotly(pCoo, tools_data$GenusPurvis, title = "Cognitition in Primates UMAP")

Problem 2: Chimpanzee Personality

##### load data

chimpRaw <- read.csv("gombe_460.csv")
str(chimpRaw)
## 'data.frame':    460 obs. of  73 variables:
##  $ chimpcode        : Factor w/ 128 levels "A100","A341",..: 53 41 103 75 93 119 91 43 16 114 ...
##  $ ratercode        : Factor w/ 18 levels "A","B","C","E",..: 11 11 5 5 5 5 5 5 5 5 ...
##  $ sex              : int  1 1 0 1 0 0 0 0 1 0 ...
##  $ community        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ chimp_YOB        : int  1972 1977 1969 1969 1969 1969 1970 1970 1971 1970 ...
##  $ month            : int  10 10 10 10 10 10 10 10 10 10 ...
##  $ day              : int  11 11 13 13 13 13 13 13 13 13 ...
##  $ year             : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ rater_start_year : int  1987 1987 1980 1980 1980 1980 1980 1980 1980 1980 ...
##  $ rater_end_year   : int  2007 2007 2009 2009 2009 2009 2009 2009 2009 2009 ...
##  $ chimp_start_year : int  1972 1977 1969 1973 1969 1983 1970 1970 1971 1992 ...
##  $ chimp_end_year   : int  2012 2012 1983 2002 1994 2007 2003 2012 2012 2010 ...
##  $ first_expr       : int  1987 1987 1980 1980 1980 1983 1980 1980 1980 1992 ...
##  $ last_expr        : int  2007 2007 1983 2002 1994 2007 2003 2009 2009 2009 ...
##  $ hpq_midpoint_yr  : num  1997 1997 1982 1991 1987 ...
##  $ hpq_age_firstexpr: int  15 10 11 11 11 14 10 10 9 22 ...
##  $ hpq_age_lastexpr : int  35 30 14 33 25 38 33 39 38 39 ...
##  $ hpq_midpoint_age : num  25 20 12.5 22 18 26 21.5 24.5 23.5 30.5 ...
##  $ dom.raw          : int  7 5 1 7 6 7 1 7 7 4 ...
##  $ sol.raw          : int  2 1 NA 4 7 7 7 2 4 7 ...
##  $ impl.raw         : int  2 2 3 4 7 6 4 4 7 7 ...
##  $ symp.raw         : int  7 3 4 5 4 4 4 6 7 4 ...
##  $ stbl.raw         : int  1 2 6 7 4 7 6 6 7 7 ...
##  $ invt.raw         : int  6 5 3 4 6 5 2 5 7 4 ...
##  $ depd.raw         : int  6 4 4 7 4 4 1 5 7 4 ...
##  $ soc.raw          : int  5 5 4 7 4 4 1 7 7 7 ...
##  $ thotl.raw        : int  5 1 3 4 2 3 4 4 4 4 ...
##  $ help.raw         : int  5 5 4 7 4 5 1 7 7 5 ...
##  $ exct.raw         : int  7 6 2 6 6 4 2 5 6 5 ...
##  $ inqs.raw         : int  5 5 3 4 4 4 2 5 4 4 ...
##  $ decs.raw         : int  6 6 6 7 7 7 7 7 7 7 ...
##  $ indv.raw         : int  2 2 7 5 7 7 7 7 7 6 ...
##  $ reckl.raw        : int  3 5 4 7 4 1 1 4 7 2 ...
##  $ sens.raw         : int  5 7 4 7 6 6 7 6 7 5 ...
##  $ unem.raw         : int  2 1 6 4 4 4 4 5 7 4 ...
##  $ cur.raw          : int  6 5 6 4 3 5 1 6 4 4 ...
##  $ vuln.raw         : int  3 2 7 5 7 7 5 7 5 6 ...
##  $ actv.raw         : int  5 7 7 7 4 5 6 5 4 7 ...
##  $ pred.raw         : int  7 6 5 7 6 4 4 5 5 4 ...
##  $ conv.raw         : int  5 1 5 5 7 7 1 4 4 5 ...
##  $ cool.raw         : int  5 3 4 4 7 7 4 7 7 5 ...
##  $ innov.raw        : int  5 5 3 7 3 4 2 6 7 4 ...
##  $ n_hpq_missing    : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ dom.meansub      : num  7 5 1 7 6 7 1 7 7 4 ...
##  $ sol.meansub      : num  2 1 3.36 4 7 ...
##  $ impl.meansub     : num  2 2 3 4 7 6 4 4 7 7 ...
##  $ symp.meansub     : int  7 3 4 5 4 4 4 6 7 4 ...
##  $ stbl.meansub     : int  1 2 6 7 4 7 6 6 7 7 ...
##  $ invt.meansub     : num  6 5 3 4 6 5 2 5 7 4 ...
##  $ depd.meansub     : num  6 4 4 7 4 4 1 5 7 4 ...
##  $ soc.meansub      : num  5 5 4 7 4 4 1 7 7 7 ...
##  $ thotl.meansub    : num  5 1 3 4 2 3 4 4 4 4 ...
##  $ help.meansub     : num  5 5 4 7 4 5 1 7 7 5 ...
##  $ exct.meansub     : num  7 6 2 6 6 4 2 5 6 5 ...
##  $ inqs.meansub     : num  5 5 3 4 4 4 2 5 4 4 ...
##  $ decs.meansub     : num  6 6 6 7 7 7 7 7 7 7 ...
##  $ indv.meansub     : int  2 2 7 5 7 7 7 7 7 6 ...
##  $ reckl.meansub    : num  3 5 4 7 4 1 1 4 7 2 ...
##  $ sens.meansub     : num  5 7 4 7 6 6 7 6 7 5 ...
##  $ unem.meansub     : num  2 1 6 4 4 4 4 5 7 4 ...
##  $ cur.meansub      : num  6 5 6 4 3 5 1 6 4 4 ...
##  $ vuln.meansub     : num  3 2 7 5 7 7 5 7 5 6 ...
##  $ actv.meansub     : num  5 7 7 7 4 5 6 5 4 7 ...
##  $ pred.meansub     : num  7 6 5 7 6 4 4 5 5 4 ...
##  $ conv.meansub     : int  5 1 5 5 7 7 1 4 4 5 ...
##  $ cool.meansub     : num  5 3 4 4 7 7 4 7 7 5 ...
##  $ innov.meansub    : num  5 5 3 7 3 4 2 6 7 4 ...
##  $ chimpcode.1      : Factor w/ 128 levels "A100","A341",..: 53 41 103 75 93 119 91 43 16 114 ...
##  $ dominance        : num  5 5 3.67 5 5.67 ...
##  $ extraversion     : num  5.5 6.25 4.16 5.25 2.5 ...
##  $ conscientiousness: num  6 5 4.67 4 3.67 ...
##  $ agreeableness    : num  5.67 5 4 6.33 4.67 ...
##  $ neuroticism      : num  7 6 2 3.5 5 2.5 2 3.5 3.5 3 ...
chimpPerson <- chimpRaw[,c(19:42)] # only getting the raw personality traits to built the unsupervised model

Free trick: NA removel using an unsupervised randomForest combined with cmdscale and kmeans

Estimating missing values

### NA REMOVAL

set.seed(1992)
pre_rf <- randomForest(chimpPerson %>% na.roughfix, proximity = TRUE)
personalityCoordinates <- (1 - pre_rf$proximity) %>% cmdscale()
cl <- kmeans(personalityCoordinates, 5) # I arbitrarly picked 5, look at your plot to decide>
vizier::embed_plot(personalityCoordinates, cl$cluster)

new_x <- rfImpute(chimpPerson, cl$cluster %>% factor)[,-1] # removes any NA in your dataset, quite robust method
## ntree      OOB      1      2      3      4      5
##   300:  17.39% 17.78% 45.76% 28.12%  5.61% 10.94%
## ntree      OOB      1      2      3      4      5
##   300:  18.04% 24.44% 47.46% 27.08%  5.10% 12.50%
## ntree      OOB      1      2      3      4      5
##   300:  18.91% 28.89% 40.68% 29.17%  5.61% 17.19%
## ntree      OOB      1      2      3      4      5
##   300:  17.39% 20.00% 42.37% 25.00%  6.63% 14.06%
## ntree      OOB      1      2      3      4      5
##   300:  18.48% 24.44% 45.76% 30.21%  5.10% 12.50%

UNSUPERVISED RANDOM FOREST

now that we have created alias for the missing values, let’s recreate the random forest

set.seed(1992)
rf <- randomForest(new_x, proximity = TRUE)
personalityCoo <- (1 - rf$proximity) %>% cmdscale()

# calculate the clusters using k-means.

cl <- kmeans(personalityCoo, 4)
vizier::embed_plot(personalityCoo, cl$cluster)

vizier::embed_plot(personalityCoo, chimpRaw$chimpcode) # doesn't really seem discriminative, but this dataset has 128 indivudals, but only 460 observations!!!

new_df <- cbind(chimpRaw, data.frame(personalityCoo))
# X1 and X2 are the two components of the model generated just above, they work as coordinates for plotting personality
indModel <- randomForest(X1 + X2 ~ dominance + extraversion + conscientiousness + agreeableness + neuroticism, data = new_df) # these are the variables I removed in the beginning, so to see if X1 and X2 are really about personality 
varImpPlot(indModel, main = "Most important traits for creating the personality variables")

ggplot(data = new_df, aes(x = X1, y = X2)) +
  geom_point(aes(color = as.factor(new_df$sex), shape = as.factor(new_df$sex)), alpha = 0.7) +
  labs(x = "First Component", y = "Second Component", color = "Sex", shape = "Sex") +
  guides(color = guide_legend(), shape = guide_legend()) + theme_minimal()

# Still there is a lot of overlap. But again, this dataset is not really optimal.

# I'd say it's impossible to detect individuals based on personality traits observed 
# at least with this dataset and observation protocol. Still let's have a look

vizier::embed_plot(personalityCoo, new_df$chimpcode)

Yep. It’s a mess. But with a dataset with far less individuals, and more observations, it might be possible.

However, let’s not give up right away in the sex estimation from personality traits. We have to remember the randomForest used was fully unsupervised.

Obviously UMAP cannot solve the individualization problem, is just too hard, and we do not have enough data. Actually we are dealing with a very extreme case of the “Curse of dimensionality”.

library(uwot)
pCoo <- umap(new_x, scale = TRUE, verbose = TRUE, y = new_df$chimpcode)
## 09:44:25 Read 460 rows and found 24 numeric columns
## 09:44:25 Scaling to zero mean and unit variance
## 09:44:25 Kept 24 non-zero-variance columns
## 09:44:25 Using FNN for neighbor search, n_neighbors = 15
## 09:44:25 Commencing smooth kNN distance calibration using 2 threads
## 09:44:25 Processing y data
## 09:44:25 Carrying out categorical intersection for 1 column
## 09:44:25 Applying categorical set intersection, weight = 0.5 far distance = 5
## 09:44:26 Initializing from normalized Laplacian + noise
## 09:44:26 Commencing optimization for 500 epochs, with 10380 positive edges
## 09:44:27 Optimization finished
vizier::embed_plot(pCoo, as.factor(new_df$chimpcode), title = "ID UMAP")

But what about separating the two sex classes…?

set.seed(1992)
pCooS <- umap(new_x, scale = TRUE, verbose = TRUE, y = as.factor(new_df$sex), spread = 16)
## 09:44:28 Read 460 rows and found 24 numeric columns
## 09:44:28 Scaling to zero mean and unit variance
## 09:44:28 Kept 24 non-zero-variance columns
## 09:44:28 Using FNN for neighbor search, n_neighbors = 15
## 09:44:28 Commencing smooth kNN distance calibration using 2 threads
## 09:44:28 Processing y data
## 09:44:28 Carrying out categorical intersection for 1 column
## 09:44:28 Applying categorical set intersection, weight = 0.5 far distance = 5
## 09:44:28 Initializing from normalized Laplacian + noise
## 09:44:28 Commencing optimization for 500 epochs, with 9280 positive edges
## 09:44:29 Optimization finished
vizier::embed_plot(pCooS, as.factor(new_df$sex), title = "Chimpanzee Sex UMAP")

Yeah right. Perfect. Too perfect. Better than estimating pelvic measurements in human bones. That just can’t be right.

Let’s do a digital validation of the algorithm.

## 75% of the sample size
final_x <- cbind(new_df$sex, new_x)
smp_size <- floor(0.75 * nrow(final_x))

## set the seed to make your partition reproducible
set.seed(1992)
train_ind <- sample(seq_len(nrow(final_x)), size = smp_size)

train <- final_x[train_ind, ]
test <- final_x[-train_ind, ]
set.seed(1992)
chimpTRAIN <- umap(train[,-1], scale = TRUE, verbose = TRUE, y = as.factor(train[,1]), spread = 10, ret_model = TRUE)
## 09:44:29 Read 345 rows and found 24 numeric columns
## 09:44:29 Scaling to zero mean and unit variance
## 09:44:29 Kept 24 non-zero-variance columns
## 09:44:29 Using Annoy for neighbor search, n_neighbors = 15
## 09:44:30 Building Annoy index with metric = euclidean, n_trees = 50
## 0%   10   20   30   40   50   60   70   80   90   100%
## [----|----|----|----|----|----|----|----|----|----|
## **************************************************|
## 09:44:30 Searching Annoy index using 2 threads, search_k = 1500
## 09:44:30 Annoy recall = 100%
## 09:44:30 Commencing smooth kNN distance calibration using 2 threads
## 09:44:30 Processing y data
## 09:44:30 Carrying out categorical intersection for 1 column
## 09:44:30 Applying categorical set intersection, weight = 0.5 far distance = 5
## 09:44:30 Initializing from normalized Laplacian + noise
## 09:44:30 Commencing optimization for 500 epochs, with 7114 positive edges
## 09:44:31 Optimization finished
vizier::embed_plotly(chimpTRAIN$embedding, as.factor(train[,1]),
                   title = "Chimpanzee Sex UMAP (semi-supervised)")
set.seed(1992)
cadTEST <- umap_transform(test[,-1], chimpTRAIN)
vizier::embed_plotly(cadTEST, as.factor(test[,1]),
                   title = "Chimpanzee Sex UMAP (semi-supervised) test")